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Abstract —We consider distributed optimization where N nodes 
in a connected network minimize the sum of their local costs 
subject to a common constraint set. We propose a distributed 
projected gradient method where each node, at each iteration k, 
performs an update (is active) with probability pu, and stays idle 
(is inactive) with probability 1 — p k . Whenever active, each node 
performs an update by weight-averaging its solution estimate 
with the estimates of its active neighbors, taking a negative 
gradient step with respect to its local cost, and performing a 
projection onto the constraint set; inactive nodes perform no 
updates. Assuming that nodes’ local costs are strongly convex, 
with Lipschitz continuous gradients, we show that, as long 
as activation probability pk grows to one asymptotically, our 
algorithm converges in the mean square sense (MSS) to the same 
solution as the standard distributed gradient method, i.e., as if all 
the nodes were active at all iterations. Moreover, when pk grows 
to one linearly, with an appropriately set convergence factor, the 
algorithm has a linear MSS convergence, with practically the 
same factor as the standard distributed gradient method. Simula¬ 
tions on both synthetic and real world data sets demonstrate that, 
when compared with the standard distributed gradient method, 
the proposed algorithm significantly reduces the overall number 
of per-node communications and per-node gradient evaluations 
(computational cost) for the same required accuracy. 

Index Terms —Distributed optimization, distributed gradient 
method, variable number of working nodes, convergence rate, 
consensus. 


I. Introduction 

We consider distributed optimization where N nodes con¬ 
stitute a generic, connected network, each node i has a convex 
cost function /,; : i —> K known only by i, and the nodes 

want to solve the following problem: 

minimize M x ) ='■ f ( x ) (1 ) 

subject to x £ X. 

Here, x £ is the optimization variable common to all 
nodes, and X C is a closed, convex constraint set, 
known by all. The above and related problems arise frequently, 
e.g., in big data analytics in cluster or cloud environments, 
e.g., [l]-[3], distributed estimation in wireless sensor networks 
(WSNs), e.g., [4]-[8], and distributed control applications, 
e.g., [9], [10]. With all the above applications, data is split 
across multiple networked nodes (sensors, cluster machines, 
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etc.), and /*( x) = fi(x\Di) represents a loss with respect to 
data Di stored locally at node i. 

A popular approach to solve (1) is via distributed (pro¬ 
jected) (sub)gradient methods, e.g., [11], [12], [13]. With 
these methods, each node i, at each iteration k, updates its 
solution estimate by weight-averaging it with the estimates of 
its neighbors, taking a negative gradient step with respect to its 
local cost, and projecting the result onto the constraint set X. 
Distributed gradient methods are attractive as they do not 
require centralized coordination, have inexpensive iterations 
(provided that projections onto X are computationally light), 
and exhibit resilience to inter-node communication failures and 
delays; however, they have a drawback of slow convergence 
rate. 

Several techniques to improve convergence rates of dis¬ 
tributed (projected) gradient methods have been proposed, 
including Newton-like methods, e.g., [14], [15], [16], [17], and 
Nesterov-like methods, e.g., [18], [19]. In this paper, we make 
distributed (projected) gradient methods more efficient by 
proposing a novel method with a variable number of working 
nodes. Each node i, at each iteration k, performs an update 
(is active) with probability p*,, and stays idle (is inactive) with 
probability 1 —pk, while the activation schedule is independent 
both across nodes and across iterations. Whenever active, 
each node i performs the same update as with the standard 
distributed gradient method, while inactive nodes perform no 
updates. 

Our main results are as follows. Assuming that the costs /,’s 
are strongly convex and their gradients are Lipschitz contin¬ 
uous, we show that, whenever the activation probability pk 
grows asymptotically to one, our method converges in the 
mean square sense to the same point as the standard distributed 
gradient method. 1 If, in addition, pk converges to unity at a 
sublinear rate at least as fast as l//c 1+ ^ (£ > 0 arbitrarily 
small), the method also converges almost surely. 

Our second group of results assumes that pk grows to one 
linearly, with the convergence factor S £ (0,1). In this setting, 
we show that the proposed algorithm has a linear convergence 
rate (in the sense of the expected distance to the solution). 
When, in addition, quantity S is set in accordance with the 
fi s condition number, we show that the proposed algorithm 
converges practically with the same linear convergence factor 
as the standard distributed gradient method (albeit with a 
larger hidden constant). Hence, interestingly, our algorithm 

'Under a constant step-size a, the standard (projected) distributed gradient 
method converges to a point in a neighborhood of the solution of (1), where 
the corresponding squared distance is O(a); see ahead Theorem 1 and, e.g., 
[ 20 ], [ 21 ], 



2 


achieves practically the same rate in iterations k as the 
standard distributed gradient method, but with the reduced 
cost per iteration k (overall communication and computational 
cost), thus making distributed gradient methods more efficient. 
Simulation examples on 1-2 -regularized logistic losses - both 
on synthetic and real world data sets - confirm that our method 
significantly reduces the communication and computational 
costs with respect to the standard distributed gradient method, 
for the same desired accuracy. Simulations also demonstrate 
that the proposed method exhibits a high degree of resilience 
to asynchrony. 

Communication and computational savings are highly rel¬ 
evant with applications like WSNs and distributed learning 
in cluster or cloud environments. With WSNs, the reduced 
communication and computational cost to retrieve the result 
translate into energy saving of the sensor motes’ batteries 
and the increase of the network lifetime. With distributed 
learning in cluster or cloud environments, less amount of 
communication and computation for a specific application/task 
means that the saved resources can be re-allocated to another 
concurrent tasks. For example, at times when a node with our 
method is idle, the resources allocated to it (e.g., a virtual cloud 
machine) can be released and re-allocated to other tasks. 

We explain intuitively the above results that we achieve. 
Namely, standard distributed gradient method exhibits, in 
a sense, two sources of redundancy-the first corresponds 
to the inter-node communications aspect, while the second 
corresponds to an optimization aspect (number of gradient 
evaluations per iteration) of the algorithm. It turns out that, 
as we show here, a careful simultaneous exploitation of these 
two redundancies allows to match the rate of the standard 
distributed gradient method with reduced communications 
and computational costs. The two sources of redundancy 
have been already noted in the literature but have not been 
exploited simultaneously before. The communication redun¬ 
dancy, e.g., [22] means that the inter-node communications 
can be “sparsified,” e.g., through a randomized protocol, so 
that the algorithm still remains convergent. In other words, 
it is not necessary to utilize communications through all the 
available links at all iterations for the algorithm to converge. 
The optimization redundancy has been previously studied 
only in the context of centralized optimization, e.g., [23], 
Increasing the probability pk here resembles increasing the 
size of the batch used along iterations in [23], The core idea 
is that, under certain assumptions on the cost functions, a 
(centralized) stochastic-type gradient method with an appro¬ 
priately increasing sample size matches the convergence rate 
of the standard gradient method with the full sample size 
at all iterations, as shown in [23], More broadly, the idea 
to use “increasingly-accurate” optimization steps in iterative 
optimization algorithms has been widely considered in the 
(centralized) optimization literature, including, e.g., inexact 
Newton methods [24]. 

Finally, although the focus of the paper is on the standard 
distributed gradient method [25], we believe that the proposed 
idling strategy is of a more general applicability. An interesting 
future research direction is to incorporate it in other distributed 
methods, including, e.g., [26], [18], [19], [27], [14], 


A. Related work 

We now briefly review existing work relevant to our con¬ 
tributions to help us further contrast our work from the 
literature. We divide the literature into two classes: 1) dis¬ 
tributed gradient methods for multi-agent optimization; and 
2) centralized stochastic approximation methods with variable 
sample sizes. The former class relates to our work through 
the communication redundancy, while the latter considers the 
optimization redundancy. 

Distributed gradient methods for multi-agent optimiza¬ 
tion. Distributed methods of this type date back at least to 
the 80s, e.g., [28], and have received renewed interest in the 
past decade, e.g., [11]. Reference [11] proposes the distributed 
(sub)gradient method with a constant step-size, and analyzes 
its performance under time-varying communication networks. 
Reference [22] considers distributed (sub)gradient methods 
under random communication networks with failing links and 
establishes almost sure convergence under a diminishing step- 
size rule. A major difference of our paper from the above 
works is that, in [28], [11], [22], only inter-node commu¬ 
nications over iterations are “sparsified,” while each node 
performs gradient evaluations at each iteration k. In [13], the 
authors propose a gossip-like scheme where, at each k, only 
two neighboring nodes in the network wake up and perform 
weight-averaging (communication among them) and a negative 
gradient step with respect to their respective local costs, while 
the remaining nodes stay idle. The key difference with respect 
to our paper is that, with our method, the number of active 
nodes over iterations k (on average) is increasing, while in [13] 
it remains equal to two for all k. Consequently, the established 
convergence properties of the two methods are very different. 

There have been many works where nodes or links in the 
network are controlled by random variables. References [29], 
[30], [31], [32] consider distributed algorithms for solving the 
consensus problem - finding an average of nodes’ local scalars 
a.j’s, while we consider here a more general problem (1). These 
consensus algorithms involve only local averaging steps, and 
no local gradient steps are present (while we have here both 
local averaging and local gradient steps). The models of av¬ 
eraging (weight) matrices which [29], [30], [31], [32] assume 
are very different from ours: they all assume random weight 
matrices with time-invariant distributions, while ours are time- 
varying. Reference [33] studies diffusion algorithms under 
changing topologies and data-normalized algorithms, under 
general, non-Gaussian distributions. Reference [34] proposes a 
control mechanism for link activations in diffusion algorithms 
to minimize the estimation error under given resource con¬ 
straints. The main differences with respect to our paper are 
that [34] assumes that local gradients are always incorporated 
(deterministic step-sizes), and the link activation probabilities 
are time invariant. 

References [35], [36], [37] provide a thorough and in-depth 
analysis of diffusion algorithms under a very general model 
of asynchrony, where both the combination (weight) matrices 
and nodes’ step-sizes are random. Our work differs from these 
references in several aspects, which include the following. A 
major difference is that papers [35], [36], [37] assume that both 
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the step sizes’and the combination matrices’ random processes 
have constant (time-invariant) first and second moments, and 
the two processes are moreover mutually independent. In 
contrast, both our weight matrices and step-sizes have time- 
varying distributions. Actually, the fact that, with our method, 
node activation probabilities converge to one (which corre¬ 
sponds to the time-varying first moment of the step-sizes) 
is critical to establish our main results (Theorems 2 and 3). 
Further differences are that papers [35], [36], [37] allow for 
noisy gradients, their nodes’ local cost functions all have 
the same minimizers, and therein the optimization problem 
is unconstrained. In contrast, we assume noise-free gradients, 
different local minimizers, and constrained problems. 

Our paper is also related to reference [38], which considers 
diffusion algorithms with two types of nodes - informed and 
uninformed. The informed nodes both: 1) acquire measure¬ 
ments and perform in-network processing (which translates 
into computing gradients in our scenario); and 2) perform 
consultation with neighbors (which translates into weight¬ 
averaging the estimates across neighborhoods), while the 
uninformed nodes only perform the latter task. The authors 
study the effect of the proportion of informed nodes and their 
distribution in space. A key difference with respect to our 
work is that the uninformed nodes in [38] still perform weight¬ 
averaging, while the idle nodes here perform no processing. 
Finally, we comment on reference [39] which introduces 
an adaptive policy for each node to decide whether it will 
communicate with its neighbors or not and demonstrate sig¬ 
nificant savings in communications with respect to the always- 
communicating scenario. A major difference of [39] from our 
paper is that, with [39], nodes always perform local gradients, 
i.e., they do not stay idle (in the sense defined here). 

Centralized stochastic approximation methods with vari¬ 
able sample sizes have been studied for a long time. We 
distinguish two types of methods: the ones that assume 
unbounded sample sizes (where the cost function is in the 
form of a mathematical expectation) and the methods with 
bounded sample sizes (where the cost function is of the form 
in (1).) Our work contrasts with both of these threads of 
works by considering distributed optimization over an arbitrary 
connected network, while they consider centralized methods. 

Unbounded sample sizes have been studied, e.g., in [40], 
[41], [42], [43], [44]. Reference [40] uses a Bayesian scheme 
to determine the sample size at each iteration within the trust 
region framework, and it shows almost sure convergence to a 
problem solution. Reference [41] shows almost sure conver¬ 
gence as long as the sample size grows sufficiently fast along 
iterations. In [42], the variable sample size strategy is obtained 
as the solution of an associated auxiliary optimization problem. 
Further references on careful analyses of the increasing sample 
sizes are, e.g., [43], [44], 

References [45], [46] consider a trust region framework and 
assume bounded sample sizes, but, differently from our paper 
and [23], [40], [42], [43], [44], they allow the sample size 
both to increase and to decrease at each iteration. The paper 
chooses a sample size at each iteration such that a balance 
is achieved between the decrease of the cost function and 
the width of an associated confidence interval. Reference [47] 


proposes a schedule sequence in the monotone line search 
framework which also allows the sample size both increase 
and decrease at each iteration; paper [48] extends the results 
in [47] to a non-monotone line search. 

Reference [23] is closest to our paper within this thread 
of works, and our work mainly draws inspiration from it. 
The authors consider a bounded sample size, as we do here. 
They consider both deterministic and stochastic sampling and 
determine the increase of the sample size along iterations 
such that the algorithm attains (almost) the same rate as 
if the full sample size was used at all iterations. A major 
difference of [23] with respect to the current paper is that 
they are not concerned with the networked scenario, i.e., 
therein a central entity works with the variable (increasing) 
sample size. This setup is very different from ours as it has 
no problem dimension of propagating information across the 
networked nodes - the dimension present in distributed multi¬ 
agent optimization. 

Paper organization. The next paragraph introduces no¬ 
tation. Section II explains the model that we assume and 
presents our proposed distributed algorithm. Section III states 
our main results which we prove in Section IV. Section V 
provides numerical examples. Section VI provides a discussion 
on the results and gives extensions. Finally, we conclude 
in Section VII. Certain auxiliary proofs are provided in the 
Appendix. 

Notation. We denote by: R the set of real numbers; W 1 
the (-/-dimensional Euclidean real coordinate space; A t j the 
entry in the i-th row and j -th column of a matrix A; 71 1 
the transpose of a matrix A; 0 and 0 the Hadamard (entry- 
wise) and Kronecker product of matrices, respectively; /, 0, 
1, and e t , respectively, the identity matrix, the zero matrix, 
the column vector with unit entries, and the i-th column of I; 
J the N x N matrix J := (l/./V)ll T ; A y 0 {A y 0) means 
that the symmetric matrix A is positive definite (respectively, 
positive semi-definite); ||-j| = ||-||2 the Euclidean (respectively, 
spectral) norm of its vector (respectively, matrix) argument; 
A i(-) the i-th largest eigenvalue, Diag (a) the diagonal matrix 
with the diagonal equal to the vector a: | • | the cardinality 
of a set; Vh(w) the gradient evaluated at w of a function 
h : R d — > R, d > 1; P(_4) and E [u] the probability of an 
event A and expectation of a random variable u, respectively. 
For two positive sequences rj n and \n, we have: r/ n = 0(xn) 
if limsupjj^^ — < oo. Finally, for a positive sequence \n 
and arbitrary sequence £„ = o(x„) if lim^oc, = 0. 

II. Model and algorithm 

Subsection II-A describes the optimization and network 
models that we assume, while Subsection II-B presents our 
proposed distributed algorithm with variable number of work¬ 
ing nodes. 

A. Problem model 

Optimization model. We consider optimization prob¬ 
lem (1), and we impose the following assumptions on (1). 
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Assumption 1 (Optimization model) (a) For all i, fi : i—>• 

K is strongly convex with modulus p > 0, i.e.: 

fi(y ) > fi(x)+yfi(x) T (y-x) + ^\\y-x\\ 2 , Vx,y G M d . 

(b) For all i, f t : i—>• K. has Lipschitz continuous gradient 

with constant L, 0 < p < L < oo, i.e.: 

II V/j(x) - V/j(y)|| < i||ar - y\\, \/x,y G 

(c) The set X C is nonempty, closed, convex, and 
bounded. 

We denote by D := max{||x|| : x G X} the diameter of 
X. Note that, as X is compact, the gradients V/j(x)’s are 
bounded over x £ X, i.e., there exists G > 0, such that, for 
all i, for all x G X, ||V/,(x)|| < G. The constant G can be 
taken as LD + tncex-i—i....,N ||V/;(0)||. Indeed, for any x G X, 
we have: ||V/ i (*)|| < jlV/^x) - V/, ; (0)|| + ||V/,(0)|| < 
L\\x\\ + ||V/j(0)|| < TD + maxj =1> ... )JV ||V/j(0)||. Similarly, 
there exist constants — oo < to/ < Mf < oo, such that to/ < 
fi(x) < Mf, \/i, Vx G X. Constants to/ and Mf can be 
taken as Mf = —rrif = G D + maxj = j a ... i jv |/t(0)|. Under 
Assumption 1, (1) is solvable and has a unique solution, which 
we denote by x*. 

Network model. Nodes are connected in a generic undi¬ 
rected network Q = (V,E), where V is the set of N nodes 
and E is the set of edges - all node (unordered) pairs { i, j } 
that can exchange messages through a communication link. 
We impose the following assumption. 

Assumption 2 (Network connectedness) The network Q = 
(V, E) is connected, undirected, and simple (no self-loops nor 
multiple links). 

Both Assumptions 1 and 2 hold throughout the paper. We 
denote by f \ the neighborhood set of node i (excluding i). 
We associate with Q a N x N symmetric weight matrix C, 
which is also stochastic (rows sum to one and all the entries 
are non-negative). We let Cij be strictly positive for each 
{i,j} G E, i ± j\ Cij = 0 for {i,j} (£ E, i ^ j; and 
Cu = 1 — JA , ■ Cij. As we will see, the weights Cij 's will 
play a role in our distributed algorithm. The quantities Cjj, 
j G f \, are assumed available to node i before execution 
of the distributed algorithm. We assume that matrix C has 
strictly positive diagonal entries (each node assigns a non-zero 
weight to itself) and is positive definite, i.e., Ajv(C) > 0. For 
a given arbitrary stochastic, symmetric weight matrix C' with 
positive diagonal elements, positive definiteness may not hold. 
However, such arbitrary C' can be easily adapted to generate 
matrix C that obeys all the required properties (symmetric, 
stochastic, positive diagonal elements), and, in addition, is 
positive definite. Namely, letting, for some k G (0,1), C := 
/ + 1 “^C", we obtain that A at(C) > n. It can be shown 
that, under the above assumptions on C, Ai(C') = 1, and 

A 2 (C) < 1. 

Note that, in the assumed model, matrix C is doubly 
stochastic, i.e., not only its rows but also its columns sum up to 
one. It is worth noting that, for practical implementations, it is 


certainly relevant to consider matrices C which are only row- 
stochastic, and not also column-stochastic. It is an interesting 
future research direction to extend the proposed method to 
row-stochastic matrices also, relying on prior works including, 
e.g- |49). 


B. Proposed distributed algorithm 

We now describe the distributed algorithm to solve (1) 
that we propose. We assume that all nodes are synchronized 
according to a global clock and simultaneously (in parallel) 
perform iterations k = 0,1,... At each iteration /,:, each 
node i updates its solution estimate x\ G X, with arbitrary 
initialization x® G X. To avoid notational clutter, we will 
assume that = x^\ Vi,j. Further, each node has an 

internal Bernoulli state variable z) . If z\ = 1, node i 

updates Xj(fc) at iteration k\ we say that, in this case, node i is 
active at k. If z\ k ^ = 0, node i keeps its current state x, ( k ) and 
does not perform an update; we say that, in this case, node i is 
idle. At each k , each node i generates z\ independently from 
the previous iterations, and independently from other nodes. 
We denote by pk := P (zi(k) = 1). The quantity is our 

algorithm’s tuning parameter, and is common for all nodes. 

We assume that, for all k, pu > p m in, for a positive constant 

Pmin- 

(k) 

Denote by fT the set of working neighbors of node i 

(k) 

at k, i.e., all nodes j G fl; with z) = 1. The update of 
node i is as follows. If z\ = 0, node i is idle and sets 
x| fc+1 ' ) = x^ k \ Otherwise, if z\ k ^ = 1, node i broadcasts its 

state to all its working neighbors j G . The non-working 

• (k) 

(idle) neighbors do not receive x\ , for example, with WSNs, 
this corresponds to switching-off the receiving antenna of a 

(k) r^y(k) 

node. Likewise, node i receives x) from all j E . Upon 

(k) J 

reception, node i updates x\ as follows: 


„(*+!) 


= ?*{ 1 - E ^ 






( 2 ) 


jen 


(k) 


Pk 


jen' 


(k) 


In (2), Vx(y) = argmin^ giV ||t; — y\\ denotes the Euclidean 
projection of point y on X, and a > 0 is a constant; we 
let a < A n(C)/L. (See ahead Remark 2.) In words, (2) 
means that node i makes a convex combination of its own 
estimate with the estimates of its working neighbors, takes a 
step in the negative direction of its local gradient, and projects 
the resulting value onto the constraint set. As we will see, 
multiplying the step-size in (2) by 1/pk compensates for non¬ 
working (idle) nodes over iterations. 


Remark 1 Setting pk = 1, Vfc, corresponds to the standard 
distributed (sub)gradient method in [25]. 

Compact representation. We present (2) in a compact 
form. Denote by x^ := ((x^) T ,..., (x^) T ) T , and z^ := 
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(z[ k \ ..., z^) T . Further, introduce F : M. Nd i-a R, with 

N 

F(x) = F(xx N ) := y^Jijxj). 

i= 1 

Also, denote by X N C R w d the Cartesian product X x ... x X, 
where X is repeated N times. Next, introduce the N x N 
random matrix W^ k \ defined as follows: 

\ Cijzf^zf ] for {i,j} £E,i^j 

WW = < 0 for {i,j} j 

{ !-E sfrWi? for i=j- 

Then, it is easy to see that, for k = 0,1,update rule (2) 
can be written as: 

x (k+i) = VxN | (w(fc) <g ) i) x W (3) 

- y (vF(a: (fe) )©(2 (A:) (g)l))|, 

where W^ k> 0 I denotes the Kronecker product of and 
the dxd identity matrix, 1 in (3) is of size dx 1, and © denotes 
the Hadamard (entry-wise) product. Note that sequence {x^} 
is a sequence of random vectors, due to the randomness of 
the z^’s. The case pk = 1, V/c, corresponds to standard 
distributed (sub)gradient method in [11], in which case (3) 
becomes: 

x (fc+1) = V x n { ( C © / ) x (fe) - a VF(x< fc) )} . (4) 

III. Statement of main results 
W e now present our main results on the proposed distributed 
method (2). For benchmarking of (2), we first present a 
result on the convergence of standard distributed gradient 
algorithm (4). All the results in the current section, together 
with some needed auxiliary results, are proved in Section IV. 
Recall that x * € R d is the solution to (1). 

Theorem 1 Consider standard distributed gradient algo¬ 
rithm (4) with step-size a < Xn(C)/L. Then, x^ converges 
to a point x* = ((x*) T , •••, ( x* N ) T ) T € X N that satisfies, for 
all i = 1,..., N: 

\\x* — rr*|| 2 < ||x* — 1 (8 x *|| 2 < aCy (5) 

_ 4 N{M f - m f ) 2NG 2 

* ' 1-A 2 (C) + /i(l-A 2 (C))' (6) 

Furthermore: 

||x (fc) -x*|| < 2VN D (1 - a p) k = O ((1 - a p) k ) . (7) 

Theorem 1 says that, with algorithm (4), each node’s estimate 

(k) 

x) converges to a point x* in the neighborhood of the true 
solution x*; the distance of the limit x* from x* is controlled 

by step-size a - the smaller the step-size, the closer the limit 

tk) 

to the true solution. Furthermore, x\ converges to a solution 
neighborhood (to x *) at a globally linear rate, equal to 1 — a/i. 
Hence, there is a tradeoff with respect to the choice of a: a 
small a means a higher precision in the limit, but a slower rate 
to reach this precision. Note also that, for a < A n(C)/L, the 
convergence factor (1—a p) does not depend on the underlying 


network, but the distance \\x* — x*|| between arbitrary node 
i’s limit x* and the solution x* depends on the underlying 
network - through the number of nodes N and the second 
largest eigenvalue of matrix C. 

Remark 2 It is possible to extend Theorem 1 to allow also 
for the step-sizes a € (A n(C)/L, (1 +A n(C))/L), in which 
case the convergence factor (1 — a p) in (5) is replaced 
with max {aL — Xn(C), 1 — a p}. We restrict ourselves to 
the case a < Xn(C)/L, both for simplicity and due to the 
fact that step-sizes a - needed to achieve sufficient accuracies 
in practice - are usually much smaller than 1/L. (See also 
Section V.) 

Remark 3 For a < Xn{C)/L , the convergence factor (1 — 
a p) is an exact (tight) worst case convergence factor, in the 
following sense: given an arbitrary network and matrix C, 
and given an arbitrary step-size a < Xn{C)/L , there exists 
a specific choice of functions /,’s, set X, and initial point 
G X N , such that ||aA fc+1 ) — x*|| = (1 — ap)\\x^ — x*||, 
for all k = 0,1, ... 2 

We benchmark the proposed method against the standard 
distributed gradient method by checking: 1) whether it con¬ 
verges to the same point x*\ 2) if so, whether it converges 
linearly; and 3) if the convergence is linear, how the corre¬ 
sponding convergence factor compares with (1 — a p) - the 
convergence factor of the standard distributed gradient method. 

References [20], [21] also analyze the convergence rate of 
the standard distributed gradient method, allowing for step-size 
ranges wider than a £ (0, Xn(C)/L\. They establish bounds 
on quantity \\x^ — 1 © x*|| which are in general different 
than (7), and they are not directly concerned with quantity 
||x( fc ) — x*||, i.e., precise characterization of convergence rate 
of xM"' 1 towards its limit. We adopt here (7) as it gives an exact 
worst-case characterization of the convergence rate towards x* 
for a £ (0, Xn{C)/L\ (see Remark 3). 

We now state our main results on the proposed algo¬ 
rithm (2). The first result deals with a more generic sequence 
of the pus that converge to one; the second result is for the 
Pk s that converge to one geometrically. 


Theorem 2 Consider algorithm (2) with step-size a < 
Xn{C)/L. Further, suppose that pk > p m in, Vfc, for some 
p min > 0, and letpfc —► 1 as k —► oo. Then, with algorithm (2), 
the iterates x^ converge, in the mean square sense, to the 
same point x* as the standard distributed gradient method (4), 
i.e., E [||*« — x*|| 2 ] -A 0 as k -A oo. Assume, in addition, 
that pk = 1 — Uk, with: 


0<u k < 


(fc + l) 1 -*’ 


for some constants C u > 0 and £ > 0. Then, x^ converges 
to x* almost surely. 


2 Consider fi : K —> R, fi(x) = x 2 /2, Vi, X = {x £ R : \x\ < 2}, 
and = 1. Note that, in this case, x* = 0, and p = L = 1. For this 
example, it is easy to show that || 2 A fe + 1 ) — x* || = (1 — a)||x^ fe ) — x* ||, for 
all k = 0,1,.... and so the convergence factor equals 1 — a fi. 
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Remark 4 Note that Theorem 2 assumes that each individ¬ 
ual fi is strongly convex. An interesting future research 
direction is to explore whether Theorem 2 can be extended 
such that one no longer requires that each /, be convex, but 
that the aggregated cost / in ( 1 ) is strongly convex. 

Theorem 3 Consider algorithm (2) with step-size a < 
Ajv(C)/ L. Further, suppose that Pk = 1 — <5 fc+1 , k = 0,1,..., 
for some <5 £ (0,1), and let 77 := max{l — ap, <5 1//2 }. Then, 
in the mean square sense, algorithm ( 2 ) converges to the same 
point 2 * as the standard distributed gradient method (4), and, 
moreover: 

E [|| 2 < fc > - ®*||] = O (kr] k ) =0(( V + e) k ) , 
for arbitrarily small positive e. Furthermore, if y/S < 1 — 0/1: 

E [|| 2 (fc) - 2 *||] =0(k{l~ap) k ) =0((l-ap + e) k ). 

Theorem 2 states that, provided that the pk s are uniformly 
bounded away from zero, from below, and pk 1 , the 
method (4) converges (in the mean square) to the same point as 
the standard distributed method (4). If, moreover, pk converges 
to one at a sublinear rate at least ^ fc+ ^i +c (where £ > 0 
can be arbitrarily small), then the convergence also hold 
almost surely. Therefore, in such scenarios, the random idling 
schedule governed by the pk s does not affect the method’s 
limit. 

Theorem 3 furthermore suggests that, provided that conver¬ 
gence of pk towards unity is linear (geometric) with conver¬ 
gence factor 6 < (1 — ap) 2 , algorithm ( 2 ) converges at the 
practically same rate as the standard method (4), i.e., as if all 
the nodes were working all the time (albeit with a larger hidden 
constant). Hence, we may expect that the proposed method (2) 
achieves the same desired accuracy as (4) with lesser to¬ 
tal cost (smaller number of the overall node activations- 
communications and computations). Still, this does not actu¬ 
ally prove the reduction in total cost with the proposed method 
(due to ignoring hidden constants). However, we observed 
that the savings indeed occur in numerical implementations. 
Section V demonstrates this on logistic losses, but a similar 
behavior is observed on other costs like strongly convex 
quadratic and “source localization” costs (see, e.g., [50] for the 
form of cost functions). The hidden convergence constant in 
Theorem 3 is dependent on the underlying network, sequence 
{ Pk }, and step-size a , and is given explicitly in Remark 6 . 

We now provide an intuitive explanation as to why the 
proposed method ( 2 ) reduces total cost with respect to the 
standard distributed gradient method (4). To this end, intro¬ 
duce x^ x \ k> ~ the (hypothetically available) 

global average of the nodes’ estimates at iteration k , and, 
for the sake of a clearer explanation, assume for a moment 
that problem (1) is unconstrained. Then, it is easy to verify 
that x < - k> evolves according to the following recursion: 

N 

* (fc+ i zf \ k = 0 , 1 ,... ( 8 ) 

Pfctv , =i 

This is actually an inexact version of the (centralized) hybrid 
stochastic-deterministic gradient method in [23]. The “inexact¬ 


ness” here stems from the fact that the term V/,; ( 2 ^) with 
the method in [23] is replaced with V/ t ^ with (2). We 

will show ahead that, for all i, E \\x\ k ^ — 2 ^^ = 0(a), 

and hence the amount of “inexactness” with ( 2 ) is relatively 
small. Therefore, the proposed distributed method (2) behaves 
similarly as the centralized hybrid method in [23]. The authors 
of [23] interpret the method therein as a “hybrid” of a 
stochastic gradient method and a standard gradient method; the 
initial algorithm phase (small pk s) corresponds to a stochastic 
gradient method, and the latter phase (pk s close to one) 
corresponds (nearly) to a standard gradient method. It is well 
known that stochastic gradient methods converge quickly in 
initial iterations but may be slow when near the solution, 
while having cheap iterations; standard gradient methods have 
a steady linear convergence rate and expensive iterations. 
Therefore, it is natural to consider a hybrid of the two - 
stochastic gradient initially and standard gradient at a later 
stage. This is why the method in [23] saves computations 
with respect to the standard (centralized) gradient method. For 
similar reasons, our proposed distributed method ( 2 ) saves cost 
over the standard distributed gradient method (4). 

Choice of pk- We now comment on the choice of pk and 
8. Note from Theorem 3 that 5 = (1 — ap) 2 is the largest, 
i.e., “most aggressive,” value of 8 which guarantees the best 
possible convergence factor (which equals to 1 — ap + e, e > 0 
arbitrarily small). Hence, it is the choice which spends the 
lowest cost while guaranteeing the best possible convergence 
factor, and hence it is optimal in the above sense. For practical 
implementations, when problems under study are very ill- 
conditioned ( with the very large value of L/p), we recom¬ 
mend the following modification: pk = max{l — i5 fc+1 , p }, 
k = 0,1,..., and 5 = min{(l — a p) 2 , 6}. 8 > 0. Quantity p 
prevents using very small values of pk in the initial iterations, 
while 8 prevents too slow increase of pk towards unity for very 
ill-conditioned problems and very small step-sizes. A possible 
choice is p = 0.1, and 5 = 0.99999. 

IV. Intermediate results and proofs 

Subsection IV-A gives intermediate results on the random 
matrices W^ and provides the disagreement estimates - 
how far apart are the estimates x\ of different nodes in 
the network. Subsection IV-B introduces a penalty-like in¬ 
terpretation of algorithm (4) and proves Theorem 1. Finally, 
Subsection IV-C proves our main results. Theorems 2 and 3, 
by applying the penalty-like interpretation on algorithm (3). 
For notational simplicity, this section presents auxiliary results 
and all the proofs for the case d = 1 , but all these extend 
to a generic d > 1. Throughout this Section, all the claims 
(equalities and inequalities) which deal with random quantities 
hold either: 1 ) surely, for any random realization; or 2 ) in 
expectation. It is clear from notation which of the two cases 
is in force. 

A. Matrices and disagreement estimates 

Matrices W^ k \ Recall that J := (1 /N ) 11 . We have the 
following Lemma on the matrices W <kl . Lemma 4 follows 
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from simple arguments and standard results on symmetric, 
stochastic matrices (see, e.g., [51]). Hence, we omit the proof 
for brevity. 

Lemma 4 (Matrices W ( fe l) (a) The sequence {W^} is a se¬ 
quence of independent random matrices. 

(b) For all k, W ^ is symmetric and stochastic (rows sum to 
one and all the entries are nonnegative). 

(c) For all k, 0 -< A I. 

(d) There exists a constant (3 G (0,1) such that, Vfc, 
E [||FF( fc ) - J|| 2 ] < /? 2 . 

It can be shown that /? can be taken as /3 2 = 1 — 
(Pmin)^ [l- (A 2 (C)) 2 ] ; see, e.g., [51]. 

Remark 5 The quantities E [|| — J\\ 2 ] clearly depend 

on k, and, more specifically, on p/ ;: . We adopt here a (possibly 
loose) uniform bound (3 (independent of k) as this suffices to 
establish conclusions about convergence rates of algorithm (3) 
while simplifying the presentation. 

Disagreement estimate. Recall £ (/c) := X EtLi - the 
global average of the nodes’ estimates, and denote by x\ = 
x [ k ■* — x^ k \ Note that both quantities are random. The quantity 
x\ measures how far is the node i's estimate from the global 
average. Denote by x^ := (x^ k \ ..., x^) T . The next Lemma 
shows that E [|||| 2 ] is uniformly bounded, Vfc, and that 
the bound is 0(a 2 ), i.e., the disagreement size is controlled 
by the step-size. (The smaller the step-size, the smaller the 
disagreements are.) The proof follows similar arguments as, 
e.g., [25], and it can be found in the Appendix. 

Lemma 5 (Disagreements bound) For all k, there holds: 



B. Analysis of the standard distributed gradient method 
through a penalty-like reformulation 

We analyze the proposed method (3) through a penalty-like 
interpretation, to our best knowledge first introduced in [52]. 
Introduce an auxiliary function 'R, : R ;V i—>• R, defined by: 
’Mz) : = Eili fi(xi)+^ xT (I~C)x = F(x) + ±x T (I- 
C)x, and the associated optimization problem: 

minimize T' ce (:r) = Eili M x i ) + - c )x /o\ 

subject to x £ X N . 

Function 'R, and (9) will be very useful in the analysis of (2). 
In fact, we will show that (2) is an inexact version of the 
(projected) gradient method on function 'R t: . Clearly, (9) is 
solvable, and it has a unique solution, which we denote by a ;*. 3 

We start by showing that standard distributed (sub)gradient 
method in [11] is an exact (projected) gradient method on 'R t . 
Indeed, the derivative V'k a (;r) = VF(x) + — C)x. The 

'The point of convergence of algorithm (4) and the solution to (9) are 
intentionally denoted by the same symbol because - as we will show - they 
actually are the same point. 


projected gradient method on 'R, with step-size a then takes 
the form: 

a;( fe+1 ) = V x n {a: (fe) -aV$ a (i w )} (10) 

= V X N^x {k) - 

a ^VF(x (t >) + i(J - C)x w ^j | , 

which, after rearranging terms, is precisely (4). 

It is easy to see that \I/ Q is strongly convex on R^, 
with modulus p! = p (which equals the strong convexity 
modulus of the /i’s). Further, V$ a is Lipschitz continu¬ 
ous on R N , with constant L 1 = L + Namely, 

\/x,y G R N : ||Vtf a (z) - Vtt a (y)|| < \\VF(x) - VFfo)|| 

+ ^ll^-C'll \\ x ~y\\ < L \\ x ~y\\ + ( Note 

that || VT(xj- \7F(y)\\ < L||a:— y\\ follows after summing the 
inequalities: |V/i(x*)- Vfi(yi)\ 2 < L 2 \x i -y 1 \ 2 , i = 1,..., N, 
and using ||VF(a:)|| 2 = Ei=i |V/i(xi)| 2 .) We impose that a 
satisfies a < jj, which, after simple manipulations, gives: 
a < Aa t(C)/(L), as introduced before. 

An immediate consequence of the fact that algorithm (4) 
is precisely the projected gradient method to solve (9) is the 
following Lemma, first observed in [52]. 

Lemma 6 ([52]) Standard distributed gradient algorithm (4) 
with step-size a < An(C)/L converges to the point x* G X N 
- the solution to (9). 

We proceed by proving Theorem 1. 

Proof of Theorem 1: As per Lemma 6, algorithm (4) 
converges to x* - the solution to (9). We hence need to prove 
for the solution to (9) the characterization in (5). 

Consider an arbitrary point x G X N , and let x := 
X Y^iLi x i■ We first prove the following inequality: 

aNG 2 

m H x *) < (*«(*) - *«oo) + 2(1 _ A2(C)) . (id 

Indeed, we have that: 

x T (I — C)x = (x — x;l) T (/ — C)(x — xl) 

> Ajv-i(/ — C)||a; — xl\\ 2 

= (1 — A 2 (C))||5;|| 2 , 

where we let x := x — xl. Further, 

N 

H fi( x ) + ( ^2(fi( x i) - M x )) ) 

i=l i—1 i—1 

N 

> f(x)-Gj2\xi~ x \ 

i=1 

> /(*)-GV^||®||. 

The second from last inequality follows because fifxf) > 
fi(x) + S7fi(x)(xi -x)> fi(x) -G\xi-x\. Combining the 
previous conclusions: 

*a(aO-*a(**) > /(*) - $«(**) - GVN\\X\\ 

+ ±(l-X 2 (C))\\x\\ 2 

2 a 
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> f(x) - ' $ a (x *) 

- sup \GVNt-±(l-\ 2 (C))A 

t>o [2a J 

> /(*)-*„(*') - 2(1 “_*f (c)) - (12) 

Next, note that \I/ Q (a;*) = min^g^jv 5' Q (a;) < v E' a (a;*l) = 
f(x*), and so — ^(a;*) > —f(x*). Applying this to (12), 
completes the proof of (11). 

We now prove claim (5) in Theorem 1. We have: 

||x* — cc* 1II 2 = \\x' ~x'l + x'l-x*l\\ 2 

< 2 ||x* — x*l|| 2 + 27V \x* — x*| 2 . (13) 

For the second summand in (13), we have: 

(/(*•)-/(**)) 

aNG 2 

“ Ny(l-X 2 (C))' 

The first inequality above is due to strong convexity of / (with 
modulus N y), and the second applies (11) with x - x* (where 
x* = Y2iLi x i')- 'A' 6 now upper bound the first summand 
in (13). We have that: 

N 1 

*«CO = Y,Mx-) + -(x-) T ( i -C)x- 

> 1 ~^ {C) W X '\\ 2 + Nm f , 

where x* = x* — x* 1. On the other hand. 


*a(x m )<f(x*)<NM f . 

Combining the obtained upper and lower bounds on T 0 (.x'*), 
we obtain for the first summand in (13): 

Combining the bounds on the first and second summands, the 
claim in (5) follows. 

It remains to prove the claim in (7). By standard analysis 
of gradient methods, we have that: 

||x (fe) - s*|| < (1 - ay) k ||x (0) - x*|| < (1 - ay) k 2s/ND, 

where we used that ||x^°' ) || < y/ND, and the same bound for 
x*. Thus, the desired result. ■ 


+ 


1 

a 


E ( 4 ‘># ) 



. (15) 


Hence, (2) is an inexact projected gradient method applied 
to T (l , with step-size a, where the amount of inexactness is 
given by vector e^ k \ 

Overall, our strategy in analyzing (14) consists of two 
main steps: 1) analyzing the inexact projected gradient method 
(14); and 2) characterizing (upper bounding) the inexactness 
vector e <k K For the former step, we apply Proposition 3 
in [53], Adapted to our setting, the proposition says the 
following. Consider minimization of tf>(y) over y £ y, where 
(j> : -> R is a convex function, and y C R m is a 

closed convex set. Let y * be the solution to the above problem. 
Further, let <p be strongly convex with modulus yp, > 0, and let 
(f> have a Lipschitz continuous gradient with constant Lp, > yp,. 


Lemma 7 (Proposition 3, [53]) Consider the algorithm: 
1 


y 


(fc+i) _ 


= Vy \ - 


V0(j/ (fc) ) + 


, k = 0 , 1 , 


(k) 

where e v is a random vector. Then, Vfc = 1, 2,...: 

Ily (fe) - V*\\ < (l-yJLt) k \\yW-y-\\ 


K 


t=i 


where £ y is the initial point. 


Note that, if VL is Lipschitz continuous with constant L ( p, 
then V(f) is also Lipschitz continuous with constant 1/a > Lp,. 
Therefore, for the function </> and the iterations: 


y (fe+1) = Vy {y^ - a 


Vf(y^) + ef 


} 


k = 0,1,..., 


there holds: 


||y (fc) — y*|| < (l-a^) fc |||/ (0) -i/ # || 

k 

+ a^(l - a/r 0 ) fe_ *||e^ _1) ||, k= 1, .(17) 

t=i 

In other words, the modified claim (17) holds even if we take 
a step size different (smaller than) 1/ L 0 . 

For analyzing the inexact projected gradient method (14), 
we will also make use of the following result. Claim (17) is 
Lemma 3.1 in in [12]. Claim (18) can be proved by following 
similar arguments as in equations (44)-(48) in [18]. 


C. Analysis of the proposed method (2) 


We now turn our attention to the proposed method (3). It is 
easy to verify that (3) can be written as: 

x (k+i) = VxN j^fc) _ a VT' a (x( fc) ) + e( fc) ]}, (14) 

where e ^ = (e[ k \ is a random vector, with z-th 

component equal to: 


„(*0 


,(k) 


«• - 


Lemma 8 Consider a deterministic sequence such 

that Vk —> 0 as k —> oo, and let a be a constant in (0,1). 
Then, there holds: 

k 

1->0. (18) 

t=i 

If, moreover, there exist positive constants C v and ( such that, 
for all k = 0,1,..., 

°~ Vk ~ (k + l)i+C’ 
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then there exists positive constant C' v such that, for all k = 

1 , 2 ,..., 

k 

S tStc 09) 

t= 1 


holds: 


E 


0 (fc )||2 


< 4(1 - p fe ) 


IV G 2 


Pin in 


72(1 -vl) 


NG 2 


(fmin) 2 (l - /?) 2 


where 


C e = 


— Ce (1 Pk)i 
ANG 2 721VG 2 


Pmin (p min ) 2 (l - /3) 2 ' 


Proof: Consider (15). We have: 


Ie5 fc) l 2 < 2 


Jfc) 


- 1 


Pk 


+ Ji E c a \ z i k)z j k) - 1 

ffc) 

< 2G 2 | —-1 

Pk 


|V /<(*| fc) )| 2 

(k) (k) i 12 


(fe) (fe) 

~ %j 


+ % E ^ 14 S- -1 


left; 


x(fe) 


+ 




Life) (fc)l 

2 

’ '<>5 

S* 

1 

= 


< 2 




+ 2 


_ *(*> 


Taking expectation, and using independence of x^ from z^: 





' Jk) 

2" 

E 

|e| fc) | 2 ' 

< 2G 2 E 

*--l 





Pk 



~2 £ c « E 


j'efk 


i 4 ‘M‘> - 9 2 


x E 


Step 1: gradient inexactness. We proceed by charac¬ 
terizing the gradient inexactness; Lemma 9 upper bounds 
quantity E [|||| 2 ]. 


~(fc) 


We proceed by upper bounding E 


-E 


A*) 


A k ) 


- 1 


(24) 


, using the 


total probability law with respect to the following partition: 

{z\ k) = 1}, and {z\ k) = 0}: 


( 4 fc) = i) + p( 4 fe) = o) 

(25) 


there 

(fe) 

Z- 

2“ 


1 

E 

-1 


= 

- 1 


Pk 



Pk 


p k + (1 - Pk) 


( 20 ) 


( 21 ) 


( 22 ) 


1 

=-1 

Pk 

= — (1 -Pk) 2 + (1 -Pk) 

Pk 

< — (1 -p k ) + (1 -Pk) 

Pk 

5 2(1 Pk) /Pmin- 


(26) 


We next upper bound E 


\ z ^k) z (k) _ i |2 ( us i n g the total 
(k) (k) 

probability law with respect to the event {z\ = 1, Zj — 1} 

and its complement; we obtain: 


E 


i 4 fe) # } -ii 2 


= (1-P(4 fe) = 1, z) K ’ = 1)) 


(fe) 


= (1-Pfe) 2 . 

Substituting (26) and (27) in (24): 


(27) 


E 


Ak) |2 


< 4G 2 (1 — Pk)/Pn 

4 

+ a 


'E'Cvii-pb 


(23) 


fen* 


Inequality (22) uses the following bound: (u + v) 2 < 2u 2 + 
2v 2 . It also uses, with m := (z^z^ — ^){x[ k ^ — £^)> the 
following relation: 

i>2 G,,u,S 2 = ' G„-0i 2 

jesL 

< ^ ] CijU 2 + Cu ■ 0 2 

je Qi 

= ^2 CijU 2 , 

jeo, 

which follows due to the fact that Yhj^o. CijUj + Cu ■ 0 
is a convex combination, and v H>• u 2 , v £ R, is convex. 
Inequality (23) uses that 

\x (k) -x^+x^-x[ k) \ 2 


x E 


~(fc) 

x) 


+ E 


Ak) 


(28) 


Summing the above inequalities over i = 1,..., N, using the 


fact that £ ien . Ctj < 1, Mi, E [||eW|| 2 ] = E 
andE[pW|| 2 ] =EE E 


„{k) | 2 


fk) 12 


, we obtain: 


E 


=( fc )|l 2 


< 4 IV G 2 (1 — Pk)/Pn 


+ — 5 - (1 — Pfc)E 


?(fc) 


Finally, applying Lemma 5 to the last inequality, the claim 
follows. ■ 

Step 2: Analyzing the inexact projected gradient method. 

We first state and prove the following Lemma on algorithm (2). 

Lemma 10 Consider algorithm (2) with step-size a < 
\n(C)/(L). Then, for the iterates x^ and x*-the solution 
to (9), Mk = 1,2,..., there holds: 


E 


\x (k) -x*\\ 2 


< 8N(l-ap) 2k D 2 
+ X^(i — a p) kt {^ ~ Pt-i)- 
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Proof: As already established, algorithm (2) is an inexact 
projected gradient method to solve (9), with the inexactness 
vector e^ k \ We now apply (17) to sequence x^ and itera¬ 
tions (3); we obtain: 


||*( fc) -*l < (1 — a^) fc ||a; (0) — x*|| 

k 

+ a^(l-a / r) fc - t ||e( t - 1 )||. (29) 

t-i 

Squaring the latter inequality, using (u + v ) 2 < 2 u 2 + 2v 2 , 
and ||ar^ 0 ^ — x*\\ < 2 VND: 

\\x {k) -x*\\ 2 < 8(1 — a p) 2k ND 2 

+ a 2 _ a P) k 

k 

x ^(l-a M ) fc - t ||e( t - 1 )|| 2 . (30) 

t= 1 

In (30), we used the following. Let 0 t = (1 — and 

S t ■= Et-, °t- Then, 

< s?Elii« < - 1, ii 2 

k 

= S t ^ 0 t ||e (t_1) || 2 , 

t= 1 

where we used convexity of the scalar quadratic function v > 
v 2 . Now, using Et=i(! - ) k ~* < i_ ( i- a>t ) = ^ (30) is 

further upper bounded as: 

||x (fc) -x*|| 2 < 8(1 — a p) 2k ND 2 

+ ^^(l-a M ) fe - t ||e( t - 1 )|| 2 . 


Taking expectation, and applying Lemma 9, we obtain the 
claimed result. ■ 

We are now ready to prove Theorems 2 and 3. 

Proof of Theorem 2: The proof of the mean square sense 
convergence claim follows from Lemma 10 by applying (18). 
Namely, setting a := 1 — ap and v t := 1 — p 2 , the desired 
result follows. 

We now prove the almost sure convergence claim. By 
Lemma 10, using Pk = 1 — Wfc, we have: 


E 


x {k) - x* 


< 8N(l-ap) 2k D 2 


+ 


2a C e 


^2(1-ap) 


k—t 


U t - 1. 


t=l 


Now, from (19), there exists a positive constant C' u such that, 
for all k = 1, 2,...: 


k 

E 

t —i 


(1 - a/i) te 4 M t _i < 


C' 

'-'u 

£4+C ’ 


and hence: 

8N(l-ap) 2k D 2 
2aC e C' u 1 
M F+C' 

Summing the above inequality over k = 1,2,..., we obtain 
that: 

OO 

E E 

k =1 

Applying the Chebyshev’s inequality and (31), we conclude 
that: 

OO 

P — at®|| > < oo, 

fc=l 

for any e > 0. Therefore, by the first Borel-Cantelli lemma, 
P (||a;( fc ( — as® || > e, infinitely often) = 0, which finally im¬ 
plies that x converges to x *, almost surely. 


r (fe) 


— £ 


< oo. 


(31) 


E 


\x {k) -x' 


< 


Proof of Theorem 3: Consider (29). Taking expectation: 

< VN{1~ ap) k 2D (32) 

k 

+ a£(l - apf-^Ceil - p 2 _ i) 

t=l 

< \/]V(l - a /r) fc 2D (33) 

k 

+ ay^(l — ap) k ~ l \/C^V2(Vsy. 

t =1 

The first inequality uses E[|it|] < (EljttI 2 ]) 1 / 2 . The second 
inequality uses 1 — p 2 _ 1 = (1 — (1 — S t )) 2 < 2(5*. Consider 
the sum in (33). For each t, each summand is upper bounded 
by r) k y/2C e , and so the sum is 0(kr/ k ). The term y/N( 1 — 
a p) k 2D = 0(i 7 fc ). Hence, the overall right-hand-side in (33) 
is 0(krj k ) = 0((r] + e) k ), which completes the proof. ■ 


E 


k (fe) - x* 


Remark 6 The proof of Theorem 3 also determines the con¬ 
stant in the convergence rate. From the above proof, substi¬ 
tuting the expression for C e in (21), it is straightforward to 
observe that, for all k = 1,2,...: 



1 


< 12 max < 


Vnd , 


aVNG 

_Pmin(l /5) 


kr) k . 


V. Simulations 

We provide simulations on the problem of learning a linear 
classifier via logistic loss, both on synthetic and real data 
sets. Simulations demonstrate that our proposed idling strat¬ 
egy significantly reduces the total cost (both communication 
and computational costs), when compared with standard dis¬ 
tributed gradient method where all nodes work at all iterations. 
Simulations also demonstrate the method’s high degree of 
robustness to asynchrony and its benefits over gossip-based 
strategies for solving (1). 

We consider distributed learning of a linear classifier via 
logistic loss, e.g., [54], Each node i possesses J data samples 
{ a ijn H ere - a ij € P d_1 , d > 2, is a feature vector, 

and bij £ {—1,+1} is its class label. We want to learn a 
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vector x = (xJ,Xo) T , X\ € M d_1 , and Xq € R, d > 1, 

such that the corresponding linear classifier sign = 

sign (xj a + Xq ) minimizes the total surrogate loss with l 2 
regularization: 

(34) 

subject to a prior knowledge that 11*11 < M, where A4 > 0 is a 
constant. Here, j7i 0 gi K (-) is the logistic loss j7i og is(a) = log(l+ 
e _ “), and R is a positive regularization parameter. Clearly, 
problem (34) fits the generic framework in (1) with fi{x) = 
Ej=iJ\ogi s (hjH x (a i:j )) + R\\x\\ 2 /2, /(:r) = £jli/»(*)> 
and X = {x £ M 4 : ||x|| < A4}. A strong convexity constant 
of the fi s p can be taken as p = R, while a Lipschitz constant 
L can be taken as ^ || J^iLi £j=i c ij c Jj II + where Qj = 
(bij a ij jbij) ■ 

With all experiments, we test the algorithms on a connected 
network with N = 50 nodes and 214 links, generated as a ran¬ 
dom geometric graph: we place nodes randomly (uniformly) 
on a unit square, and the node pairs whose distance is less 
than a radius are connected by an edge. 

All our simulations are implemented in Matlab on a single 
standard personal computer. Hence, the network itself is also 
emulated over a single computer. An interesting future step is 
to run the proposed algorithm over a distributed environment 
(e.g., a cluster with 50-100 cores) where actual operation 
imperfections may reveal further real implementation effects. 

Experiments on synthetic data. In the first set of experi¬ 
ments, we generate data and set the algorithm parameters as 
follows. Each node i has J = 2 data points whose dimension 
is d — 1 = 3. We generate the cq/s independently over i 
and j; each entry of a t j is drawn independently from the 
standard normal distribution. We generate the “true” vector 
x* = ((a^) T ,*o) T by drawing its entries independently 
from standard normal distribution. Then, the class labels are 
generated as by = sign ((xj;) T ajj + Xq + e^), where e^-’s 
are drawn independently from normal distribution with zero 
mean and standard deviation 0.1. The obtained correspond¬ 
ing strong convexity parameter p = 0.1, and the Lipschitz 
constant L « 0.69. Further, we set A4 = 100 and R = 0.1. 

With both algorithms, we initialize at*(0) to Vx{hi), where 
the hi s, i = 1..... A r , are generated mutually independently, 
and the entries of each h t are generated mutually indepen¬ 
dently from the uniform distribution on [—50, +50]. We utilize 
the Metropolis weights, e.g., [55]. With the proposed method, 
we set Pk = 1 — <5 fc+1 , k = 0,1,..., and <5 = (1 — a p) 2 . 

As an error metric, we use the relative error in the objective 
function averaged across nodes: 


N ( J i 

'y ' I y ' iTiogis ( bijH x (aij )) + — 7^||x|| 
i =1 \j=l 


if/(4Vr 

N ^ f * 

*=l J 


/*> o, 


where /* is evaluated numerically via the (centralized) pro¬ 
jected gradient method. With the proposed method, we run 
100 simulations and consider both the average (empirical 
mean) relative error (averaged across 100 simulation runs with 
different instantiations of node activations, i.e., variables z(k)) 


and the relative error’s histograms. 

We compare the two methods with respect to the total cost 
(which equals the total number of activations across all nodes), 
where a unit cost corresponds to a single node activation at 
one iteration; we also include the comparisons with respect to 
the number of iterations k. We consider two different values 
of step-sizes, a € { 25 q l > 5 (jx}- which correspond to different 
achievable accuracies by both methods. 

Figure 1 compares the proposed and standard distributed 
gradient methods for a = 1/(50 L). We can see that the 
proposed method significantly reduces total cost to reach a 
certain accuracy, while at the same time it does not induce 
an overhead in the total number of iterations. For example, 
consider Figure 1 (a) which plots the relative error averaged 
over 100 simulation runs versus total cost. We can see that, 
to achieve relative error e = 0.01, the proposed method has 
on average the total cost around 16, 700, while the standard 
distributed gradient method requires around 25,500, which 
gives the relative savings of about 33%. At the same time, the 
two methods require practically the same number of iterations 
(Figure 1 (b)). Figure 1 (c) shows the histogram for the 
proposed method of the total cost to achieve relative error 
e = 0.01, where an arrow indicates the cost of the standard dis¬ 
tributed gradient method (which equals 25, 500). Figure 1 fd) 
repeats the study for the total number of iterations. Figure 2 
shows the comparisons for a = 1/(250 L), and it shows 
histograms to reach e = 0.005, again demonstrating very sig¬ 
nificant improvements. To reach relative error e = 0.005, the 
proposed method requires the total cost around 90, 000, while 
the standard distributed gradient method takes 137,000, which 
corresponds to the proposed method’s savings of about 44%. 

Experiments on a real world data set. In the second set of 
experiments, we consider the same network with 50 nodes and 
test the algorithms on data set “ala” which we downloaded 
from the repository: http://www.csie.ntu.edu.tw/ cjlin/libsvm/. 
There are N J = 1,600 data points 4 (J = 32 per node) 
of dimension d — 1 = 119 (optimization variable dimension 
is 120). We set M. = 100 and 1Z = 0.1, p = R, and 
L = l maxj = i. jv || £/ =1 c io cj,|| + R. 

We use all the system and algorithmic parameters the same 
as in the first set of experiments, except the following. With 
both methods, we initialize 2 +'* to zero, for all i. With 
the proposed method, we set pk = maxjl — 0.1}, 

k = 0,1,..., and S = min{(l — ap) 2 , 0.99999}. (See the 
discussion in the last paragraph of Section III.) As the error 
metric, we use the average cost function (averaged across 
nodes) 

i N 

i=l 

With the proposed method, we run one sample path realization. 

Figure 3, a) and b), compares the proposed and standard 
distributed gradient methods for the “ala” data set and step 
size a = 1/(50 L). We can that the proposed method reduces 
the total cost at least 3 times, while incurring a very low 

4 There are actually 1,605 points hut we discarded 5 of them to evenly 
distributed data points across nodes. 
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overhead in the total number of iterations. 


Modeling and testing asynchronous operation. In appli¬ 
cations like, e.g., WSNs, accounting for asynchrony in the 
algorithm’s operation is highly relevant. In such scenarios, 
when node i decides to activate at iteration k and transmit 
a message to node j, this message may be lost, due to, e.g., 
packet dropouts in WSNs. In addition, an active node may fail 
to calculate the local gradient at iteration fc, because the actual 
calculation may take longer than the time slot allocated to 
iteration k, or due to unavailability of sufficient computational 
resources at k. Therefore, under asynchrony, the schedule of 
the realized inter-neighbor communications and local gradient 
evaluations is not under the full control of networked nodes. 


We introduce the following model. At each link {i,j} £ E 
and each k = 0,1,let z) t ^ be a binary random variable 
which takes value one if the communication link is online and 
zero otherwise; let := P = 1^. Therefore, variable 

models a failure of link {i,j} at k. Similarly, for each 

’ . . ^(k) 

node i, introduce a binary random variable z) , which takes 

the value one if the calculation of V/,; * s successful 

and zero otherwise. We let pi := P ^zj k) = 1^. Variable zip' 1 

hence models failure of node i’s gradient calculation at 

(k) 

(As before, each node i activates if z) = 1 and stays idle if 
(k) . k) 

z\ = 0.) We assume that the variables z/ ( are independent 

. . . ; . k) 

both across links and across iterations; likewise, the z\ s 

are independent both across nodes and across iterations; and 

that the node activations, the link failures, and the gradient 

calculation failures are mutually independent processes. Note 

that z is in control of node i, while the zPP’s and zj k ^’s 

are governed “by nature.” The update of node i and iteration k 

is as follows. If z; k) = 0, node i stays idle; else, if = 1, 

we have: 


„(fc+i) 


Vx{ (i- £ 


Z 3 


Cii )x\ 


(k) 


(35) 


A k )z( k ) „{k) 


'V Z i Z {i,j} X J 


+ ^2 Ci - 
jetoi 

- (*<*>)}. 
Pk 


Note that we assume that nodes do not have prior knowledge 
on the asynchrony parameters p,’s and Puj\’ s. 

We provide a simulation example on the synthetic data set 
and the 50-node network considered before with the same 
simulation parameters and a = 1/(50L). Each Pu.j} is 
set to 0.5, while for the p)’s we consider two scenarios: 
1) lower failure probabilities, where one half of the nodes 
has pi = 0.9 and the other half has % = 0.5; and 2) 
higher failure probabilities, where a half of the nodes has 
Pi = 0.9 and the other half has % = 0.1. Note that the latter 
scenario corresponds to rather severe conditions, as one half 
of the nodes successfully computes the gradients only with 0.1 
probability. 

Figure 3 (c) shows the performance of the proposed method 
for three scenarios: no failures, lower failure probabilities, 
and higher failure probabilities. It shows the empirical mean 


of the relative error averaged across 100 iterations (higher 
failure probabilities are shown with a dashed line, and lower 
failure probabilities with a dotted line.) We can see that 
the proposed algorithm exhibits a very strong resilience to 
asynchrony. First consider the higher failure probabilities 
scenario (dashed line in Figure 3 (c)). We can see that, despite 
the severe conditions, the proposed algorithm still converges 
close to the solution, naturally with a decreased convergence 
rate and with a moderately increased limiting error. Now, 
consider the lower failure probabilities scenario (dotted line in 
Figure 3 (c)). The proposed algorithm again generally slows 
down convergence, as it is expected. However, interestingly, it 
actually achieves a higher degree of accuracy asymptotically 
than under the synchronous scenario. This is explained as 
follows. The effective step-size of node i with algorithm (35) 
equals // zp^zj fe \ which is on average api. Hence, in a sense. 
Pi has the effect of decreasing the step-size. The step-size 
decrease has the known effect of slowing down convergence 
rate but improving the asymptotic accuracy, as confirmed in 
Figure 3 (c). The improved asymptotic accuracy indeed occurs 
as long as the pi’s are not mutually too different. When the p,’s 
are mutually too far apart, different nodes effectively use very 
different step-sizes (which equal to api), and this disbalance 
makes a negative effect on both the convergence speed and on 
the asymptotic accuracy - as confirmed in Figure 3 (c) for the 
higher failure probabilities case. 

Comparison with a gossip-based scheme. To further 
corroborate the benefits of the proposed idling scheme with 
increasing activation probabilities, we compare it - on the 
synthetic data set and a = 1/(50 L) - with the gossip- 
based scheme in [13]. Figure 3 (d) plots the relative error, 
averaged over 20 simulation runs, versus iteration number. 
(Gossip is shown in a dash-dot line.) With both methods, 
we use the same step-size parameter a. We can see that the 
proposed scheme outperforms gossip. Most notably, the gossip 
scheme has a larger steady state error. We explain why this 
happens. Namely, with gossip, only two nodes (out of N) are 
active at all iterations. This means that, essentially, the gossip- 
based scheme behaves as an incremental gradient method 
(more precisely, a mini-batch) gradient method, where the full 
gradient (which equals the sum of N local nodes functions’ 
gradients) is at all times approximated with the sum of two 
local gradients. Therefore, the gossip-based scheme incurs an 
increased steady state error, for a similar reason as the fact why 
the (centralized) incremental gradient method with a constant 
step size does not converge to the exact solution. In contrast, 
our method essentially behaves as a full gradient method, thus 
leading to a higher accuracy. 


VI. Discussion and extensions 

Subsection VI-A extends our results to the case of convex 
costs which do not have to be differentiable nor strongly 
convex, while Subsection VI-B provides a quantification of 
the gains in the total cost for a special case of quadratic costs 
with identity Hessians. Proofs of the results in the current 
section can be found in the Appendix. 
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A. Relaxing strong convexity and differentiability 

We assumed throughout the previous part of the paper that 
the fi s are strongly convex and have Lipschitz continuous 
gradients. We now extend our results to more generic cost 
functions, when these two assumptions are relaxed. Specifi¬ 
cally, we now let each f, : R. d —> K. be convex and Lipschitz 
over set X , i.e., for all i, there holds: 


\fi(x) - fi(y)\ < G\\x-y\\, \/x,y£X, (36) 


for a nonnegative constant G. We continue to assume that 
X is convex and compact, so (36) is satisfied for any convex 
function, e.g., [56]. Optimization problem (1) is solvable under 
this setting. 

The proposed algorithm (2) generalizes straightforwardly: at 
node i and iteration k, gradient V/> j is replaced with 
an arbitrary subgradient from the subdifferential set of /, at 
x[ k \ dfi [x^^. We note that Lemmas 5 and 9 continue to 
hold here as well. 

Before presenting our result on the modified algorithm (2), 
we recall that the standard distributed gradient method 
achieves for the setting assumed here the following perfor¬ 
mance. Define, for each node i, the running average: 


(fc) 

x) ' — — 


1 k— 1 

1 V x (t) 

1 ’ 

t=0 


k = 1,2, 


Then, for all i (see, e.g., [25]): 



+ O (a ). 


(37) 


in Note that p = L = 1 and x* = bi. Note that 

this is a rather simple problem, where the optimal solution 
b* can be obtained if each node solves its own optimization 
problem, and then the results are averaged, e.g., through a 
consensus algorithm. However, it is very useful to illustrate 
in a clear way the cost savings of the proposed method. 
Denote by b* := 1 ® x* and b := (bj, ...,bff) T £ R Nd . 
For simplicity, we consider equal weights C i: j = Co, for all 
{i,j} £ E , so that weight matrix C = I — Co C, where C is 
the (un-normalized) zero-one graph Laplacian matrix. Denote 
by A i{C) the i-th smallest eigenvalue of C, i = 1, ...,7V (as is 
common with the Laplacians). Then, for Co < A.y (£), we 
have || C — J|| = 1 — CoA 2 (£). From now on, we write 
simply A; = A,(£). Denote by R sp := || [ (I — J) (g> I] 6||, 
and by Rq := ||aA°) — 6*||. Quantity R sp measures how spread 
are the bi’ s, i.e., how the bi ’s (minimizers of the individual 
fi’ s) are far apart from solution x* = jj Y^fLi bi- With the 
proposed method, we set pk = 1 — \5 k+1 , k = 0 , 1 ,..., 
with 8 = 1 — a 6, 8 £ (0,1/a]. (This is a slightly different 
choice from one considered in the rest of the paper.) We 
consider as an error metric the norm of the mean distance 
to the solution: ||E [aA fc )] — 6*||. (The expectation here has 
no significance for the standard distributed gradient method 
as it is deterministic.) This is not a very strong metric, but 
nonetheless it allows to derive neat expressions. We denote the 
latter quantity with the standard distributed gradient method 
by ff k%> and with the proposed method by ■ 

Intermediate results. We derive the following upper 
bounds on and x (k \ respectively. For all k = 3,4,..., 
there holds: 


For method (2), we show the following. Assume that activation 
probability pk = 1 — «fc, rtfc > 0, Vfc, satisfies that: 


S u ■= s/uk < oo. (38) 

k =0 

Then, for all i, for all k = 1,2,...: 


E 


frt k J) - r 


AND 2 2V2VN DffCfSu 


- aG% T 2 ctC e -f- 


< 

2 ak 
aNG 2 


+ 


k 

3 aNG 2 


2(1-A 2 (C)) Pmin ( 1-/3) 


(39) 


where G\ := 2 NG 2 + 18 w[l | g - ) 2 ■ Therefore, as long as pk 

converges to one sufficiently fast (per condition (38) it suffices 
to have, e.g., pk = 1 — ( fc+1 1 ) 2+<; , C > 0 arbitrarily small), the 
idling schedule does not violate the O (a + bound. 


B. Quantifying reduction in total cost 

Although Theorem 3 demonstrates that the proposed method 
achieves practically the same convergence factor (in terms of 
iterations k) as the standard distributed gradient method, the 
Theorem does not explicitly quantify the cost reduction needed 
for achieving a prescribed e-accuracy. Quantifying this in full 
generality is very challenging. We pursue here the special case 
of quadratic costs with identity Hessians. 

Setting. We let TV > 2, and let fi : —> R be fi(x) = 

7}\\x — bi\\ 2 , i = 1, ...,7V, where the bf s are constant vectors 


ffk) 

< 

ff k ) 

Sub 

:= (1 — a) k Ro + a R sp (N - 

-1) 

(40) 


v 

1 - 

(1 - a - c 0 A 2 ) fc 




A 


C 0 A 2 + (X 



x (fc) 

< 

(fc) 

Xub 

:= (1 — a) k Rq + a R sp (N - 

-1) 

(41) 


X 

( 1 

V c 0 A 2 (1 — S k / 2 ) + a 




- 1 - 

(ffz 

- a - c 0 A 2 (l - S)) (k 1)/2 




c 0 A 2 (l-i5)+a 

Results. Based on the above inequalities, we derive the 
following result. Let the desired accuracy be e, i.e., we want 
that: Cub < e and X ub < Then - f °r a = 2 (^°-i)fl, P and 
9 > —k~, after: K e = -^(N-D 2 \ n tABs.\ iterations, we have 

that Cub = e ( 1 + °( £ ))> and Xub = e(l + o(e)), i.e., both 
algorithms achieve the same error e after the same number of 
iterations K e (up to lower orders in e). Therefore, the proposed 
method achieves savings in total cost (per node) equal to: 
K ? - £f=o Pk, which is approximately ^ j- 

It is worth noting that, while these worst-case savings on the 
special quadratic costs with identity Hessians may not be very 
large, simulations on the more generic costs (with non-unity 
condition numbers Iff fi) and real world data demonstrate large 
savings (as presented in Section V). 

VII. Conclusion 

We explored the effect of two sources of redundancy with 
distributed projected gradient algorithms. The first redundancy. 
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well-known in the literature on distributed multi-agent op¬ 
timization, stems from the fact that not all inter-neighbor 
links need to be utilized at all iterations for the algorithm 
to converge. The second redundancy, explored before only in 
centralized optimization, arises when we minimize the sum 
of cost functions, each summand corresponding to a distinct 
data sample. In this setting, it is known that performing 
a gradient method with an appropriately increasing sample 
size can exhibit convergence properties that essentially match 
the properties of a standard gradient method, where the 
full sample size is utilized at all times. We simultaneously 
explored the two sources of redundancy for the first time 
to develop a novel distributed gradient method. With the 
proposed method, each node, at each iteration k, is active with 
a certain probability pk, and is idle with probability 1 — pi~, 
where the activation schedule is independent across nodes 
and across iterations. Assuming that the nodes’ local costs 
are strongly convex and have Lipschitz continuous gradients, 
we showed that the proposed method essentially matches the 
linear convergence rate (towards a solution neighborhood) of 
the standard distributed projected gradient method, where all 
nodes are active at all iterations. Simulations on 1,2 -regularized 
logistic losses, both on real world and synthetic data sets, 
demonstrate that the proposed method significantly reduces 
the total communication and computational cost to achieve a 
desired accuracy, when compared with the standard distributed 
gradient method, and it exhibits strong resilience to the effects 
of asynchrony. 


Appendix 


A. Proof of Lemma 5 

Proof: Consider (2), and denote by: 

(k) 

y^:= E - ^-VfM k) )- 

jefiiUfi} k 


Also, let e ^ = (e G \ ...,e^) T , where := -p x jjE j — 
(k) 

y\ . Then, (3) can be written in the following equivalent form: 

a^+i) = IU (fe) x (k) - — ( VF(x (fe) ) © z {k) ) + e (fc) . (42) 

Vi- V / 


We first upper bound ||e^ fe ^||. Consider e\ k \ We have: 
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(43) 


(k) x? ) _ 2£i_ v/ . (a .W> 


Pk 
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E 

I jeOiU{i} 


jk) 


Vfi{x[ k) ) 

Pk 


y k \ v/>(4 fc) )l < 


(44) 

(45) 


Equality (43) holds because 


« n ,u(.) as 

(k) 

a convex combination of the that belong to X by con¬ 
struction, and due to convexity of X. Inequality (44) is by the 
triangle inequality. Finally, (45) is by the non-expansiveness 
property of the Euclidean projection: \Vx{u] — Vx{v}\ < 
|it — i>|, Vm, rel. Therefore, we obtain the following bound 


c(k)| 


f (k)\ 


< 


2 \/NaG 


Pmh 


(46) 


We now return to (42). Note that xJ k ^ = (J — J)x^ k \ Also, 
IU (fc) J = JW™ = J , by Lemma 4 (b). Thus, we have that: 

- J. Also, (W^ — J)(I — J) = - 

J-JW^+J 2 = WW-J, because JW^ = J and J 2 = J. 
Using the latter, and multiplying (42) from the left by (I— J), 
we obtain: 


x (k+i) = ( W (k) _ ~(fe) 

- — (I - J) ( VF(x ik) ) © z {k) )+(I- J)e (fc) .(47) 

Pk V ) 

Taking the norm and using sub-additive and sub-multiplicative 
properties: 

||x (fc+1) || < \\W^ k) - J|| ||£ (fc) || 

+ — ||(J- J) (vF(x^)&z^) || 

Pk v > 

+ ||(/- (48) 


It is easy to see that || ( \7F(x^) © z^) || < y/NG. Hence, 
using the sub-multiplicative property of norms and the fact that 
||7 — J|| = 1, there holds: \\(I - J) ( VF(x^) © z^ ) || < 
y/NG. Similarly, from (46): ||(/ - J)e^\\ < 2 ^° G - 
Combining the latter conclusions with (48): 

p( fc+1) || < \\W {k) - J\\ P (fe) || + 3 ^ NaG , ( 49 ) 

Pmin 

Squaring the latter inequality, we obtain: 


p (fe+i) ir < 

+ 

+ 


||IU (fc) -J|| 2 p (fe) || 2 


6 ^ aG \\ W w - j|| pwii 

Plain 


' 3y/N a g' 


Palin 


(50) 


Taking expectation, using the independence of and x (k> , 

and the inequality E[|tt|] < (E[w 2 ]E 2 , for a random variable 
u, we obtain: 

E [||* (fc+1) || 2 ] < E [||W (fc) - J|| 2 ] E [||i (fc) || 2 ] 

n iwm - Ji2 ]r 
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Denote by £ <k ' ) := (E [||tE^ fe ^ || 2 ]) 1 “. Applying Lemma 4 fd), 
writing the right-hand side of (51) as a complete square, and 
taking the square root of the resulting inequality: 

g( fe +i ) <f3^) + 3VNaG . (52) 

Pmin 

Unwinding recursion (52), and using the bound 1 A /? A ... A 
/3 k < jzrg, we obtain that, for all k = 0,1,.... there holds: 

< 3a '{F < f . Squaring the last inequality, the desired result 

follows. ■ 


B. Proof of result (36) in Subsection VI-A 

Assume for simplicity that d = 1 but the proof extends to 
a generic d > 1. Let x* be a solution of (9) (which exists as 
—► K. is continuous and constraint set X is compact.) 
The update of the proposed method can be written as: 

x {k+1) = r xN j x (*9 - a + e (fc) J . 

Here, is a subgradient of 'T Q at xJ k> which equals: 

oi k) = gf + -(/ - cy fc) 

a 

= gW + ±(I-C)xW, 
a 

where = (g[ k \ ...,gffl) T , and g [ k ' is a subgradient of /,; 
at x\ k \ Also, recall e ^ from (14). Note that Lemmas 5 and 
9 continue to hold here as well, and therefore, it is easy to 
show that, for all k = 0 , 1 ,...: 

E [||ffi fc) || 2 j < G% := 2 NG 2 A 18 ^ (53) 

L J (Pmin)(l P) 

Now, a standard analysis of projected subgradient methods, 
following, e.g., [57], gives: 

|. (fe+t) _ a ,.,| 2 < \\ x (k)_ x »f 


2a 2 


i^’ii 2 


, K-l 

T7 £ 


Consider the running average xilf 1 := [ K x^ k \ Using 

convexity of 'F a , applying (53), and taking expectation: 


E [*„(*<*>) - tf Q (a;*) 


AND 2 2 VND 




k (fe+1) -X* 


— 2a — x(g^ A e^) 
+ a 2 \\gP+eW\\ 2 . 

Using \\xW -x*|| < 2 y/ND, and 

> *a(® (fc) ) + («?i fc) ) T (x* - *«), 


we further obtain: 


k =0 

|x(°) — x* || 2 AaVNDy^ 

K 1 K 2 ^ 


+ aGl + ^Y. E [!|e (fc) H 2 ]- 

k=0 

Next, note that E[||e^ fc '*j|] < \j2C es fuk, and E[||e^ fc ^|| 2 ] < 
2C e Uk , where we recall Pk = 1 — Ufc• Applying the latter 
bounds on (54), Using the facts that Uk > 1, for all k, and 
that S u := Y^k=o y/uf, we obtain: 

E[tf a (*W)-tf a (®*)] (55) 

AND 2 | 2s/2s/ND^/Cf-Su 
~ 2aK + K 

A aGq, A 2aC e . 

Applying (10) to xi k) . defining x ( r k) := jfY,iLi x i%> and 
taking expectations, it follows that: 

E (56) 

AND 2 2s/2 VND^/Cr e S u 


H - CxGiu T 2 OiC e + 
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aNG 2 

2 (1-A 2 (C))- 


Finally, using the same argument as in [18], equation (22), 
we obtain the desired result. 


H^fe+t) _ ,p .||2 < _ x »\\2 _ 2 Q (tt a (a;< fc >) 

-*a(® # )) + AaVND\\e^\\ A 2 a 2 ||^ fc) || 2 A 2a 2 ||e (fe) || 2 . 

Summing the above inequality for k = 0, ...,K — 1, dividing 
the resulting inequality by K, and using (53), we obtain: 


C. Proof of results in Subsection VI-B 


We let d = 1 for notational simplicity. Consider x (k> — b*, 
where we recall 6 * = j? ^ bi 1. Then, it is easy to show 
that, for k = 0 , 1 ,..., the following recursive equation holds: 

(x {k+1) -b*j=C (z (fe) - 6 *) A a(I - J)b, (57) 

where C = C — al. Therefore, for k = 1 , 2 ,..., we have: 

k -1 

x {k) _ b * = c k (x w -b*)+aJ2 - J)b. (58) 

i =0 

Let qi denote the *-th unit-norm eigenvector, and A, the *-th 
eigenvalue of Laplacian C, ordered in an ascending order. We 
have that Ai = 0, A 2 > 0, and qi = ^=1. Further, note that 

He'll = 1 — a. Then, there holds: 

N 

C k ~\l - J)b = 53(1 - C 0 A,: - a) k -%qj(I - J)b, 
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because qj (/ — J)b = 0. Therefore, from (58), we obtain: 

k-1 

£(*) < (l - a) fc i ? 0 + aR sp (N - 1) £(1 - a - c 0 A 2 ) fc - t , 

t=o 

which yields (39). 

Now, we consider algorithm (2). Recall quantity x (k> — 
||IE[a; ( ' fc - ) ] — 6 *||. Considering the recursive equation on E[a/ fc (] — 
b*, completely analogously to the above, we can obtain: 

X (fe) < ( l-a) k R 0 (59) 

k-1k-1 

+ aR sp (N - 1) Ena a — c 0 A 2 (1 — (5 S+1 )) . 

t—0 s—t 

We now upper bound the sum in (59). We split the sum into 
two parts: 

(fe—1)/2 fc_l 

5i = 5Z n (! - « - c 0 A 2 ( 1 - <5 S+1 )) (60) 

t—0 s=t 

k-1 k-1 

s 2 = J2 na — eoA.d-^.Chl) 

t=(k- l)/2+l s=i 


(To avoid notational complications, we consider odd k and 
k > 3.) Next, note that 

( fc — 1)/2 

51 < (1-«-coA 2 (1 

*=0 

fe ^ At X i 

5 2 < ^ (l-a-c 0 A 2 (l-<5 fc / 2 )) ” ” . 

t=(fe-l)/ 2 +l 


From the above bounds, it is easy to show that (41) follows. 

Inf 2 / '° ) , 

Now, let k := K e = y a c ’ , and a = 2fl c °^-^ 1 ^ . It is easy 
to show that, for these values, quantity in (39) is e(l + 
o(e)). We now show that quantity xf! in (40) is also e(l + 
o(e)). First, note that (1 — a) Kt = |(l + o(e)). Next, consider 
the term: 

aR sp (N- 1) aR sp (N — 1) 
coA 2 (1 - <5 fe / 2 ) + a ~ c 0 A 2 (1 - S k / 2 ) ' 

/ \ 9/2 

Note that, for 9 > 0, we have that 5 k / 2 ~ = o(l). 

Therefore, we have that: 


aR sp (N-l) 
c 0 A 2 (1-^/2) 


l(l + o(l)). 


Therefore, we must show that the remaining term: 

aR sp (N - 1) (1 - a - cq A 2 (l - 6)) {k ~ 1)/2 
Co A 2 (1 — 5) + a 

< f? sp (iV 1)(1 cr cq A 2 (l - ( 5 )) (fe_1)/2 = o(e). 


Observe that: 


(1 - a - c 0 A 2 (l - (5)) (fc 1)/2 


(2i?o 


(l+c 0 A 2 e)/2 


This term is o(e) if 9 > 1/(coA 2 ), which we assumed, and 
therefore we conclude that Xub ' n (40) is e(l + o(l)). 

As noted, the established saving of is asymptotic, i.e.. 


it holds when the required accuracy e (and hence, step-size a) 
goes to zero. We now demonstrate by simulation (where one 
sample path is run) that the saving of at least 7 ^ holds for 
finite accuracies also. We consider a N = 4-node connected 
network. We set Co = l/(2iV), and 9 = l/(co A 2 (£)). The 
quantities 6 ,’s are generated mutually independently from the 
uniform distribution on [0,5]. We compare the proposed and 
standard methods (solution estimates initialized with zero at all 
nodes) in terms of the total cost, and the number of iterations, 
needed to reach accuracy e = e(a) = EE (IE ,5 ^ or 
the a’s in the range as shown in Table 1. Figure 4 snows the 
simulated savings (dotted line) and the savings predicted by 
the (asymptotic) theory (equal to l/(2a0)). We can see that the 
simulated savings are larger, i.e., they are at least l/(2cd9), and 
we can also see that they indeed behave as 1 /a, as predicted 
by the theory. 

To further corroborate that the gains with the proposed 
method hold in non-asymptotic regimes, it is instructive to 
compare it with a naive method which stays idle for l/( 2 a 0 ) 
iterations, and then it continues as the standard distributed 
gradient method. Namely, such a method also has asymptotic 
savings in total cost (with respect to the standard method) 
of 1 /( 2 a6) (as a —> 0 ), just like the proposed method. 
However, it is clear that such savings are asymptotic only, 
i.e., they do not appear in finite time regimes. That is, with 
respect to the standard method, the naive method only “shifts” 
the error curve along iterations to the right. Table 1 shows 
total costs and iteration costs for e-accuracy with the three 
methods (proposed, standard, and naive). We can see that the 
proposed method indeed performs very differently from the 
naive method, i.e., it achieves real, non-asymptotic savings. 
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(a) Average relative error vs. total cost (all nodes). 



(b) Average relative error vs. number of iterations. 



(c) Histogram: total cost to reach rel. err. 0.01. 



standard dis. grad. 


(d) Histogram: ^iterations to reach rel. err. 0.01. 

Fig. 1: Comparison of the proposed and standard distributed 
gradient methods for the synthetic data set and a = In 
Figures (c) and (d), the arrows indicate the performance of 
the standard distributed gradient method: total cost ss 25, 500 
(Figure (c)); and number of iterations ss 507 (Figure (d)). 
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(a) Average relative error vs. total cost (all nodes). 




(a) Average cost function vs. total cost for “ala.” 
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(b) Average cost function vs. number of iterations for “ala.” 


(b) Average relative error vs. number of iterations. 
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total cost to reach e relative error x 1 q5 

(c) Histogram: total cost to reach rel. err. 0.005. 



standard dis. grad. 

(d) Histogram: total cost to reach rel. err. 0.005. 

Fig. 2: Comparison of the proposed and standard distributed 
gradient methods for a = 25 * L . In Figures (c) and (d), the 
arrows indicate the performance of the standard distributed 
gradient method: total cost ss 137,000 (Figure (c)); and 
number of iterations ss 2, 740 (Figure (d)). 



(c) Average relative error vs. number of iterations. 



Fig. 3: Figures (a) and (b): Comparison of the proposed 
and standard distributed gradient methods for data set “ala;” 
Figure (c): Effect of failures on the proposed method for the 
synthetic data set; and Figure (d): Comparison of the proposed 
method with the gossip-based scheme in [13] for the synthetic 
data set. 
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Fig. 4: Difference between the total costs of the standard 
distributed gradient method and the proposed method (in the 
log 10 scale) versus step-size a. 





